Skip to content

Conversation

@emkornfield
Copy link
Collaborator

@emkornfield emkornfield commented Sep 26, 2025

🥞 Stacked PR

Use this link to review incremental changes.


This PR addes the ability to remove files in a Transaction.

To accomplish this we re-use FilteredEngineData returned in the scan plan to remove file. In order to not drop fields some additional metadata is not passed through via "fileLevelConstants" struct, and transformed back. This does not add FFI Bindings yet.

IMPORTANT: One edge case this introduced is the ability to potentially drop stats if a table only has parsed_stats available in checkpoints, as kernel does not currently read them. We should discuss if we are OK to merge given this constraint.

Testing: Add unit tests to ensure serialized fields are round-tripped and that scan planning ignores the files.

BREAKING_CHANGE: Schema for scan rows is changed but this should mostly be backwards compatible.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Sep 29, 2025
@emkornfield emkornfield force-pushed the stack/remove_files branch 3 times, most recently from 45fcb18 to d59a24e Compare October 9, 2025 21:21
@emkornfield emkornfield force-pushed the stack/remove_files branch 3 times, most recently from 52fe8f1 to b31735d Compare October 15, 2025 18:38
@codecov
Copy link

codecov bot commented Oct 15, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.00%. Comparing base (87e3552) to head (fdd4700).

Files with missing lines Patch % Lines
test-utils/src/lib.rs 66.66% 0 Missing and 5 partials ⚠️
kernel/src/transaction/mod.rs 94.87% 0 Missing and 4 partials ⚠️
...src/engine/arrow_expression/evaluate_expression.rs 0.00% 3 Missing ⚠️
kernel/src/actions/visitors.rs 66.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1353      +/-   ##
==========================================
+ Coverage   84.96%   85.00%   +0.04%     
==========================================
  Files         114      114              
  Lines       29445    29571     +126     
  Branches    29445    29571     +126     
==========================================
+ Hits        25017    25137     +120     
+ Misses       3220     3216       -4     
- Partials     1208     1218      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@emkornfield emkornfield requested review from zachschuermann and removed request for zachschuermann October 15, 2025 18:56
emkornfield added a commit that referenced this pull request Oct 15, 2025
## 🥞 Stacked PR
Use this
[link](https://github.com/delta-io/delta-kernel-rs/pull/1281/files) to
review incremental changes.
-
[**stack/with_data_change**](#1281)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1281/files)]
-
[stack/add_stats_to_remove](#1390)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1390/files/a9159f03f6260388d7bbf9f4fd2ca06f8db21f8f..7637fb90686733d81815f4fae5f9e47880f55a4f)]
-
[stack/remove_files](#1353)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1353/files/7637fb90686733d81815f4fae5f9e47880f55a4f..94e2108cbea608b98284c634807b4f265cddfa29)]

---------
This PR adds a with_data_change method to apply whether a transaction
represents a data change to all the file additions (and in the future
removals). This helps prevents clients from making accidentally mixing
files marked with dataChange = true or false, and is in the long term
the direction we want to go in.

In the future we will also like want to physically record dataChange at
the commit level, and move away from per file denotation.

Some implications of this change:

* The schema of add_files is now dependent on whether this flag is sent
(this is to allow connectors to maintain backwards compatibility).
* Change the default engine to no longer pass through dataChange at the
file level but instead require using the new setter.

Testing:
1.  Existing FFI tests verify backwards compatibility.
2. Write tests now call this method with the default engine and no
change of output happens.
3. An additional test is added to verify setting the flag to false,
writes the appropriate value on add actions.

BREAKING CHANGE: Engines must use with_data_change on the transaction
level instead of passing it to the method. `add_files_schema` is moved
to be scoped on a the transaction.
@emkornfield emkornfield marked this pull request as ready for review October 15, 2025 20:19
@emkornfield emkornfield changed the title feat: add remove_files API feat!: add remove_files API Oct 15, 2025
.with_inserted_field(
// extended_file_metadata
Some("path"),
Expression::literal(true).into(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to figure out if this needs to be adaptive when the fields it is true for are null, or if hard-coded true is sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Java kernel also hard-codes this.

samansmink pushed a commit to samansmink/delta-kernel-rs that referenced this pull request Oct 19, 2025
## 🥞 Stacked PR
Use this
[link](https://github.com/delta-io/delta-kernel-rs/pull/1281/files) to
review incremental changes.
-
[**stack/with_data_change**](delta-io#1281)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1281/files)]
-
[stack/add_stats_to_remove](delta-io#1390)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1390/files/a9159f03f6260388d7bbf9f4fd2ca06f8db21f8f..7637fb90686733d81815f4fae5f9e47880f55a4f)]
-
[stack/remove_files](delta-io#1353)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1353/files/7637fb90686733d81815f4fae5f9e47880f55a4f..94e2108cbea608b98284c634807b4f265cddfa29)]

---------
This PR adds a with_data_change method to apply whether a transaction
represents a data change to all the file additions (and in the future
removals). This helps prevents clients from making accidentally mixing
files marked with dataChange = true or false, and is in the long term
the direction we want to go in.

In the future we will also like want to physically record dataChange at
the commit level, and move away from per file denotation.

Some implications of this change:

* The schema of add_files is now dependent on whether this flag is sent
(this is to allow connectors to maintain backwards compatibility).
* Change the default engine to no longer pass through dataChange at the
file level but instead require using the new setter.

Testing:
1.  Existing FFI tests verify backwards compatibility.
2. Write tests now call this method with the default engine and no
change of output happens.
3. An additional test is added to verify setting the flag to false,
writes the appropriate value on add actions.

BREAKING CHANGE: Engines must use with_data_change on the transaction
level instead of passing it to the method. `add_files_schema` is moved
to be scoped on a the transaction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants